Feature Selection: An Exploration of Algorithm Performance
Jason Case, Abhishek Dugar, Daniel Nkuah, Khoa Tran
2024-07-30
Introduction
What is feature selection? Why do we care?
- Feature selection is a crucial step in data preprocessing, especially in machine learning and statistical modeling. It involves selecting a subset of relevant features (variables, predictors) for building a model.
- It is important because it reduces the risk of overfitting and enhances the performance of machine learning models and pattern recognition systems.
Introduction: Early Techniques
Forward, backward, and stepwise variable selection in linear models
- The forward method starts with no variables in the model, and we will add them one by one until no improvement is shown.
- Backward is when we start with all variables and remove the least significance iteratively.
- Stepwise selection combines both approaches, iteratively adding and removing variables to optimize the models performance.
Introduction: Early Techniques
Univariate screening procedures (USP)
- USP represents an important milestone for feature selection.
- These methods involve evaluating each predictor variable individually to determine its relationship with the target variable. The process selects variables that meet a specific statistical threshold.
- While these methods are simple and quick to use, they often miss the complex connections between variables.
Introduction: Modern Techniques
Advanced feature selection methods
- Similarity-based approaches select features based on their similarity or dissimilarity.
- Information-theoretical-based approaches use concepts like entropy and mutual information to evaluate feature importance.
- Sparse-learning-based approaches focus on identifying a small, essential set of features by limiting the number of features selected.
- Statistical-based approaches utilize statistical tests and models to determine the significance of features for predicting the target variable.
Introduction: Modern Techniques
Classification of methods
- The filter model selects features based on the general properties of the training data, independent of any learning algorithm, making it computationally efficient for large datasets.
- The wrapper model, on the other hand, uses a specific learning algorithm to evaluate and determine which features to keep, often leading to better performance but at a higher computational cost.
- Embedded methods perform feature selection during the model training process, integrating selection directly with learning.
- Hybrid methods combine the best aspects of filter and wrapper methods to achieve optimal performance with manageable computational complexity.
Methods: Correlation - Based Feature Selection (CFS)
- Filter method
- Uses the correlation coefficient to measure the relationship between each variable and the target, independently.
- Features that are highly correlated are considered redundant.
CFS’s feature subset evaluation function is:
\(\text{Ms} = \frac{k\bar{\text{rcf}}}{\sqrt k + k(k - 1)\bar{\text{rff}}}\)
Methods: Recursive Feature Elimination (RFE)
- Wrapper method.
- Removes variables iteratively.
Steps:
- Train the classifier.
- compute the ranking criterion for all features.
- remove the feature(s) with smallest ranking values.
- Repeat until desired number of features selected.
- optimal subset of features - one that provides the highest accuracy
Methods: Least Absolute Shrinkage and Selection Operator (LASSO)
- Embedded method
- LASSO finds the coefficient of features
- Uses L1 Regularization to shrink coefficients to zero,
The LASSO estimate is defined by the solution to the l1 optimization problem
minimize \(\frac{\| Y - X\beta \|_2^2}{n}\) subject to \(X \sum_{j=1}^{k} \|\beta_j\|_1 < t\)
where \(t\) is the upper bound for the sum of the coefficients
Methods: CFS & RFE
- Hybrid method.
- Combines filter and wrapper methods.
Step 1: Filtering using the correlation coefficient. (e.g. trim the “low hanging fruit”)
Step 2: Remove remaining variables iteratively using RFE.
Analysis: Spambase Dataset
- 4601 instances of emails
- 57 features for classification tasks
- Binary classification: email is spam (1) or not (0)
- 80/20 train/test split
- Increased number of features by adding all \((\binom{57}{2} = 1,596)\) two-way interactions
Analysis: Spambase Dataset
Figure 1. Frequency of spam targets.
Analysis: COVID-19 NLP Text Classification Dataset
- 45k tweets related to COVID-19, labeled for sentiment analysis.
- Five Sentiment label classes, ranging from extremely positive to extremely negative
- Recoded to a binary classification task, positive (1) or negative (0)
- 33,444 (91%) training and 3,179 (9%) testing records after recoding
Analysis: COVID-19 NLP Text Classification Dataset
Figure 2. Visualization of common words.
Analysis: COVID-19 NLP Text Classification Dataset
Figure 3. Frequency of sentiment targets.
Analysis: COVID-19 NLP Text Classification Dataset
“Bag of words”
- Text data converted into a matrix of word frequencies
- Each row represents a document
- Each column represents a unique word from the entire corpus
- Large (several thousand variables), sparse (> 99% of values = 0) feature set
Analysis: COVID-19 NLP Text Classification Dataset
Figure 4. Distribution of words per Tweet.
Statistical Modeling
Three metrics:
- Accuracy on the test set
- difference between accuracy on the training and test set (overtraining)
- number of variables selected (model complexity)
5 models:
- Baseline Full logistics model with no feature selection
- CFS Select best of 20 correlation thresholds using cross validation
- RFE Select best of 20 sized subsets using cross validation
- LASSO Select best penalty term using cross validation
- CFS + RFE Remove 20% of variables with lowest correlation, Select best of 20 sized subsets using cross validation
Statistical Modeling
Figure 5. Example of cross-validation accuracy with CFS model for the Sentiment task.
Statistical Modeling
Figure 6. Example of coefficient behavior with LASSO model for the spam task.
Results: Spambase Dataset
Table 1. Results of spam classification task.
| Baseline |
1653 |
0.915 |
0.070 |
| CFS |
126 |
0.936 |
-0.004 |
| RFE |
1144 |
0.911 |
0.067 |
| LASSO |
50 |
0.903 |
-0.006 |
| CFS + RFE |
683 |
0.920 |
0.045 |
Results: COVID-19 NLP Text Classification Dataset
Table 2. Results of sentiment classification task.
| Baseline |
4820 |
0.846 |
0.097 |
| CFS |
1449 |
0.868 |
0.042 |
| RFE |
2980 |
0.863 |
0.059 |
| LASSO |
4472 |
0.880 |
0.045 |
| CFS + RFE |
2672 |
0.778 |
0.088 |
Results: Observations per Feature
- Decreased the ratio of observations per feature
- Selected 10% of records for spam task
- Increased features for sentiment task by decreasing threshold for “rare” words from 10 to 2.
- Ran analysis on new datasets
- reported overfitting as a function of observations per feature
Results: Observations per Feature
Figure 7. Plot of overfitting as a function of observations per feature.